Lecture 23: Cloud OLTP

Let’s start with the hidden assumptions in Aurora

Recall our Cloud Change list from OLAP. What’s different for OLTP?

  1. Delegate components to good-enough subsystems, and their dev and ops teams
    1. Eg storage in S3? No! (why not?)
    2. Eg cluster management in k8s?
  2. Elasticity is available: what shall we use it for and how? Impact of single-master
  3. Need to address global reach and geolatencies.
  4. Lots of shared work across queries/users
  5. Diversity of HW
  6. SLOs. E.g. “best effort” vs “reserved”
  7. Security concerns: Enclaves?
  8. Learn from workloads
  9. IaaS pricing models and impact on design

Many Nice Lessons from the Aurora Paper

Required reading! In fact, so well-written and full of sweet sense that it almost “hides the ball” of its Achilles’ Heel(s)!

BUT, BUT, BUT… it’s a single-master non-serializable design! Sigh.

PolarDB-SCC: Serializable Reads for Aurora-Style Replication using RDMA

  1. do not wait for irrelevant data, apply entire logs sequentially. instead, track max LSN on fine-grained subsets
    • Hierarchical: global, per-table, per-page max timestamps (LSNs)
    • check from top-down and if success at any level process request
      • else wait for relevant pages to arrive
  2. Use “Linear Lamport” timestamps to avoid fetching RW node timestamp
    • RW node is the “timestamp oracle”
    • Can reuse a timestamp fetch for transactions that “happened-before” the fetch request
      • e.g. in the below, ```mermaid flowchart LR subgraph RO node subgraph request 1 t_1[“r_1 arrives”] t_4[“r_1 needs TS, can reuse TS3rw!”] end subgraph request 2 t_2[“r_2 fetch TS”] t_3[“TS response”] end t_1 –> t_2 end subgraph RW node TS3rw end t_2 –> TS3rw –> t_3 t_3 –> t_4

      ```

    • special case: request arrives before r2’s fetch, but needs a TS between the request and response of r2.
      • make it wait for r2’s response!
    • The happens-before tracking is done by caching <$TS_{rw}$, $TS_{ro}$>
      • If a request arrives before $TS_{ro}$, use cache
    • Note: now only one TS fetch per transaction per RO node fails to cache.
  3. RDMA for log-shipping/timestamp-fetching
    • low latency
    • sidesteps the CPU at RW node!
    • Doesn’t seem to address Aurora’s cross-AZ desires?!
    • Detail: hashtable of hierarchical timestamps needs to be at a fixed location/size in memory
      • one TS per hash value; collisions OK, keep max (makes readers more conservative)
    • Remote logging a bit complicated
      • RW node has a fixed-size ring buffer for log, one “log writer” thread per RO node
      • need to rate-match:
        1. the buffer at RW node
        2. the buffers at the RO nodes
      • if the rate-matching fails, RO node can fetch log from shared cloud storage (!!)
  4. Read-your-writes across SQL statements:
    • RW node returns an LSN for each write to the proxy
    • Proxy makes sure to route reads to a node that’s past that LSN (RO or in worst case RW node)

What about Spanner/Cockroach?

Yes, you can implement/use 2PC! The existence of Spanner is a nice proof point, but it’s debatable whether it’s a nice system.